-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29367: prevent Long overflows in ConvertJoinMapJoin #6237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
| } | ||
| Operator<? extends OperatorDesc> parentOp = joinOp.getParentOperators().get(pos); | ||
| totalSize += computeOnlineDataSize(parentOp.getStatistics()); | ||
| totalSize = StatsUtils.safeAdd(totalSize, computeOnlineDataSize(parentOp.getStatistics())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's appropriate to use safeAdd for a table size?
on the other side hashTableDataSizeAdjustment does that as well, so I guess it's fine
cc @zabetak, @thomasrebele
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. The total size here does not need to be 100% correct, it's just an estimation that influences the join decision. Might make sense to rename it to estimatedTotalSize.
| if (cs != null) { | ||
| String colTypeLowerCase = cs.getColumnType().toLowerCase(); | ||
| long nonNullCount = cs.getNumNulls() > 0 ? numRows - cs.getNumNulls() + 1 : numRows; | ||
| long nonNullCount = cs.getNumNulls() > 0 ? Math.max(1L, numRows - cs.getNumNulls() + 1) : numRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Math.max(0L, numRows - cs.getNumNulls()) + 1
|
@konstantinb, do we need same for |



What changes were proposed in this pull request?
HIVE-29367: fixing overflows in ConvertJoinMapJoin calculations
Why are the changes needed?
ConvertJoinMapJoin does not use StatsUtils.safeAdd()/saveMult() for all its calculations. There are some real life scenarios when it could perform a catastrophic decision to convert a join to a mapjoin after calculating negative size for the 'small" table, resulting in an OOM during query processing
Does this PR introduce any user-facing change?
No
How was this patch tested?
Via unit testing and with load testing on a custom Hive installation based of 4.0x version
You can see the test output generated by the pre-fix code here:
it clearly confirms the decision of perform a mapjoin despite very large volume of data